PO boil end-to-end + robust switch cap (solves 5/5 across 5 seeds)#37
Merged
Conversation
Delegate option execution to option_model.get_next_state_and_num_actions instead of duplicating its termination logic (stuck detection, Wait atom-change checks) and directly accessing its simulator.
…inement Extract the duplicated backtracking loop from run_low_level_search (SeSamE) and _refine_sketch (agent bilevel) into a single run_backtracking_refinement function in planning.py. Both callers now delegate to it with their own sample_fn and validate_fn callbacks, eliminating ~80 lines of duplicated loop/backtracking logic.
Replace 60 lines of manual option-model execution with a call to run_backtracking_refinement using max_tries=[1] and a sample_fn that returns the pre-grounded options. Remove unused Any import.
Move the _current_observation assignment into _reset_state so callers don't need to remember the two-step pattern. Clarify the relationship between _current_observation (backing field) and _current_state (typed read accessor) in docstrings and comments.
Adds agent_bilevel_plan_sketch_file setting that, when set to a file path, loads the plan sketch directly from that file, bypassing the foundation model query. Includes test data files and a unit test.
Extract repeated wait-termination check into _check_wait_termination helper and unify the three _terminal branches into a single definition with config checks inside the function body.
- Remove dead/commented-out code and stale self-question comments - Add _VIRTUAL_OBJECT_TYPES constant to replace hardcoded type-name skip lists in _set_state and _get_state - Move env-specific _get_robot_state_dict branches to subclass overrides in pybullet_cover and pybullet_blocks - Extract _get_camera_matrices helper to deduplicate render methods - Extract _get_object_state_dict from _get_state for per-object logic - Move create_pybullet_block/sphere to pybullet_helpers/objects.py - Merge _create_task_specific_objects into _set_domain_specific_state - Rename: _reset_state -> _set_state, _reset_custom_env_state -> _set_domain_specific_state, _extract_feature -> _get_domain_specific_feature - Add docstrings explaining where each method is called from
Reorganize methods into labeled sections (Setup, Public API, Core Loop, State Write/Read, Grasp Management, Action Helpers, Rendering, Utilities) so related functions are adjacent. Update module docstring to document the main public API and state synchronization methods.
Add _step_base() and _domain_specific_step() to PyBulletEnv base class. step() now calls _step_base (robot control, physics, grasp) then _domain_specific_step (water filling, heating, etc.), gated by _skip_domain_specific_dynamics flag for kinematics-only mode. Migrate all 15 domain envs to override _domain_specific_step() instead of step(). Envs with pre-step logic (coffee, switch, blocks, cover) still override step() for the pre-step part only.
Document the step_base → domain_specific_step → get_observation flow, _skip_domain_specific_dynamics flag, and _domain_specific_step as an optional override.
Replace direct access to private _skip_domain_specific_dynamics attribute with a public constructor parameter, so callers declare kinematics-only mode at creation time instead of mutating internal state after construction.
…ging Both AgentSessionMixin and AgentExplorer had near-identical wrappers that ran session.query() synchronously via nest_asyncio or asyncio.run. Move that logic into a module-level run_query_sync helper in session_manager and have both callers delegate to it.
…y and maintainability
Distinguishes the grounded-plan explorer from upcoming bilevel variants. AgentExplorer -> AgentPlanExplorer, get_name() 'agent' -> 'agent_plan', file moved to agent_plan_explorer.py, and all callers / docstrings / YAML config examples updated accordingly.
The mixin is pure agent-session plumbing (session creation, lifecycle, explorer factory) and has no approach-specific logic, so it belongs next to session_manager.py, tools.py, and the sandbox managers rather than in approaches/.
The explorer asks a Claude agent for a plan sketch, refines it against the approach's current (possibly learned) option model, and rolls the refined plan out in the real env. When the mental model disagrees with reality — e.g. the sketch expects JugFilled after a Wait but the mental model's process dynamics can't produce it — the explorer truncates the plan at the deepest unsatisfiable subgoal (inclusive) so the real-env rollout ends exactly where the disagreement occurs, maximising signal per experiment. Key pieces: - predicators/agent_sdk/bilevel_sketch.py: extracted the sketch build / parse / refine helpers from AgentBilevelApproach as module-level functions so both the approach (solve path) and the new explorer (exploration path) can share them. refine_sketch gains truncate_on_subgoal_fail: the on_step_fail callback snapshots the deepest subgoal failure seen during backtracking, and on exhaustion the captured prefix is returned as the experiment plan. - predicators/explorers/agent_bilevel_explorer.py: new explorer. Reads option_model from tool_context (synced by the approach), builds the sketch prompt via bilevel_sketch, runs refine_sketch with check_subgoals=True, check_final_goal=False, truncate_on_subgoal_fail =True, wraps the result in an option_plan_to_policy that converts OptionExecutionFailure into RequestActPolicyFailure so the episode cleanly terminates at the point of real-env divergence. Stashes the sketch subgoals/options on ToolContext for downstream diffing by the learning approach. - predicators/approaches/agent_bilevel_approach.py: shim methods over bilevel_sketch; behaviour unchanged. - predicators/approaches/agent_planner_approach.py: _create_explorer dispatches both "agent_plan" and "agent_bilevel" through the agent factory path and forwards CFG.explorer as the name. - predicators/explorers/__init__.py: factory branch merged for the two agent-session-backed explorers. - predicators/agent_sdk/tools.py: ToolContext gains last_sketch_subgoals / last_sketch_options fields, populated by the explorer and marked TODO for the learning approach to consume. - tests/explorers/test_agent_bilevel_explorer.py: happy-path, fallback, wait-memory-injection, and deepest-subgoal-failure truncation tests.
- New setting agent_bilevel_explorer_max_samples_per_step (default 50), separate from the solve-path budget, so the explorer's backtracking cost is independently tunable. - Log the actual experiment plan (option names, objects, params) after refinement so the explorer's output is visible alongside the existing sketch/truncation log lines. - Test config updated to set both budgets explicitly.
AgentSimLearningApproach extends AgentBilevelApproach to learn process dynamics online. Each cycle: the agent synthesizes parameterized process rules via Claude (using run_python / evaluate_simulator / test_simulator MCP tools), parameters are fitted via emcee MCMC, and the learned dynamics are composed with a kinematics-only PyBullet oracle into a combined option model for plan refinement. Key pieces: - predicators/approaches/agent_sim_learning_approach.py: the approach. Initialises with a kinematics-only option model (so AgentBilevelExplorer sees disagreements at process-dynamic subgoals like JugFilled/Boiled), and replaces it with the kin+learned model after each successful synthesis cycle. - predicators/agent_sdk/tools.py: create_synthesis_tools() builds the three MCP tools the synthesis agent uses; extra_mcp_tools field and get_allowed_tool_list(extra_names=) plumbing lets the approach inject them into the session. - predicators/code_sim_learning/: ParamSpec, fit_params (emcee MCMC), compute_mse, LearnedSimulator. - predicators/ground_truth_models/boil/gt_simulator.py: ground-truth process-dynamics simulator for the boil environment. - tests/: approach and param-fitting tests.
- agents.yaml: comment out agent_bilevel preset, add agent_sim_learning with explorer=agent_bilevel and skip_test_until_last_ite_or_early_stopping. - common.yaml: disable failure/test video recording, set num_online_learning_cycles=1 for faster iteration.
Simulation primitives (code_sim_learning/utils.py): - apply_rules(state, rules, params) → ProcessUpdate - merge_updates(base_state, updates, process_features) → State - simulate_step(state, action, base_env, rules, params, features) → State These replace _build_fitted_step_fn, merge_process_updates, _sim_fn_from_rules, and the body of _build_combined_simulator. GT simulator factory (ground_truth_models): - GroundTruthSimulatorFactory ABC + get_gt_simulator(env_name) discovery, following the existing get_gt_options / get_gt_nsrts pattern. - PyBulletBoilGroundTruthSimulatorFactory registered in boil/. - Replaces the hardcoded _load_oracle_simulator in the approach. Oracle ablation flags (settings.py): - agent_sim_learn_oracle_sim_program: load GT rules, skip synthesis. - agent_sim_learn_oracle_sim_params: use GT param values, skip MCMC. Also: kin_env → base_env rename throughout, redundant self._types assignment removed, process_features computed once in __init__.
- yapf + isort autoformatting applied to all touched files. - pylint: fix logging-not-lazy in agent_bilevel_explorer, add broad-except and reimported disables in agent_sim_learning_approach. - mypy: fix base/env variable name collision, add type: ignore on lambda inference, add return type annotations to GT factory methods.
Use utils.abstract to evaluate expected atoms in low-level search so that DerivedPredicates — which require a Set[GroundAtom] rather than a State — are handled correctly alongside regular predicates.
When sequential simulate calls differ only in process features (as in the combined kinematic+learned simulator), reapplying joint positions and tearing down/recreating grasp constraints causes visible arm jitter. Compare robot poses first and skip the kinematic reset path when they already match.
Factor simulator synthesis into a shared _learn_simulator helper so that both learn_from_offline_dataset and learn_from_interaction_results can trigger it on their respective trajectory sources. Also create a separate headless env for parameter fitting so MCMC's thousands of _set_state calls don't thrash the GUI env during training.
Add the cross-cutting CFG.partially_observable flag. In PO mode the jug type drops heat_level so the agent never sees the latent's name; heat is kept internally (state.privileged plus the jug.heat_level sim attribute), WaterBoiled reads the derived observable bubbling_level, and the heating/state-reset paths route off the observable array. Fully-observable mode is unchanged.
Partial-observability variant of agent_sim_predicate_invention: synthesized rules carry a latent block across steps and may declare LATENT_INIT, read from the simulator file. The parent loader now execs that file once and returns its namespace, so LATENT_INIT loads without a second exec; also guards the oracle-sim-program path as incompatible with partial observability.
gt_simulator_po.py is the answer-key for the heat-hidden boil env: it carries the hidden per-jug heat in a recurrent latent block and surfaces it only as the observable bubbling_level (the env's monotone ramp), never touching the heat_level feature that is absent in PO mode. Gates are hard (no soft thresholds) since the recurrent fit is gradient-free. Both boil GT-simulator factories now gate get_env_names on CFG.partially_observable, so get_gt_simulator dispatches to exactly one module per run: the PO simulator under partial observability, the fully-observable gt_simulator.py otherwise.
…roach The latent mechanism is orthogonal to predicate invention, so it moves from AgentSimRecurrentPredicateInventionApproach down into the base AgentSimLearningApproach, auto-activated by rule signature (has_latent_rules). Fully-observable simulators (3-arg rules) take the existing non-latent paths unchanged; partially-observable ones (5-arg rules) thread a latent block through fitting, the combined simulator, and the oracle-param SSE diagnostic. This lets the base approach (which keeps all ground-truth predicates, no invention) load and solve with the PO GT simulator: the oracle-program path no longer asserts against partial observability. The recurrent predicate-invention approach slims to just its synthesis prompt, inheriting every latent mechanic from the base.
agent_po_gt_sim runs the base agent_sim_learning approach (keeps all ground-truth predicates) with the PO GT simulator loaded as the oracle program and oracle params, on the heat-hidden boil env. A fixed plan sketch and zero online cycles mean no LLM is queried, so it is a fast, deterministic end-to-end check. The LLM-driven agent_predicate_invention block is commented out so the launcher targets only this test.
boil/__init__.py imported only the fully-observable simulator factory, so get_gt_simulator (which discovers GroundTruthSimulatorFactory subclasses via get_all_subclasses) never saw PyBulletBoilPOGroundTruthSimulatorFactory and raised NotImplementedError for pybullet_boil under partially_observable. Import the PO factory and add it to __all__ so the PO oracle simulator is discoverable.
The strict raise on a reconstruction mismatch was gated on whether an env overrode _get_state() -- a leaky proxy for 'has an exact state<->sim mapping'. An env may override _get_state() for a non-kinematic reason (e.g. boil attaching a hidden-heat privileged block) without making its robot reconstruction any less lossy than the base env's, which spuriously promoted benign ~0.02 rad IK round-trip noise into a fatal ValueError. Replace the proxy with an explicit _strict_set_state_reconstruction ClassVar defaulting to False (warn). pybullet_blocks, whose State<->sim mapping is exact, opts into True. Behavior is unchanged for every existing env (blocks raises as before; all others warn as before).
- training.py: blank line after a nested import block (isort 5.10.1). - structs.py: suppress arguments-differ on DerivedPredicate.holds and ConceptPredicate.holds, which intentionally keep the legacy 3-arg signature (base Predicate.holds gained a latent param); they already suppress the mypy override error. - pybullet_boil.py: h != h -> np.isnan(h) (comparison-with-itself) and iterate init_dict via .items() (consider-using-dict-items).
The _set_state reconstruction guard used a per-env boolean (_strict_set_state_reconstruction) to decide whether a State<->sim round-trip mismatch should raise or merely warn. That required each env to assert "my mapping is exact", which is brittle: pybullet_fan, for instance, stores fan positions symbolically and places the bodies by side, so a valid State legitimately round-trips with ~0.35 m of benign position disagreement -- not an angle, so it wasn't covered by the existing IK-noise rationale either. Replace the flag with two universal magnitude thresholds on PyBulletEnv: warn above _reconstruction_warn_atol (1e-3, unchanged behavior) and raise above _reconstruction_raise_atol (2.0). Benign reconstruction error is workspace-scale at most (~0.8 m worst case by fan geometry, well under 2.0), while an impossible or corrupt requested feature (e.g. held=-10000, off by 1e4) is far above it -- so only the latter aborts, for every env, with no per-env opt-in. pybullet_blocks drops the flag and uses the base defaults; its held=-10000 reset test still raises as before.
The master merge kept both sides of the conflict in code_sim_learning/utils.py, leaving two byte-identical definitions of iter_feature_residuals and tripping mypy's no-redef check. Drop the second copy.
…ng agent_po_predicate_invention settings
…ate_invention Renames the recurrent partial-observability predicate-invention approach file and its class (AgentSimRecurrentPredicateInventionApproach -> AgentPOSimPredicateInventionApproach), updating all references across settings, structs, agent_bilevel, utils, the predicatorv3 agents config, and tests.
The synthesis tools (evaluate_step_fit, report_residuals) scored rules
through the legacy per-transition path (apply_rules, 3 args), while the
fitting engine calls recurrent rules with 5 args (apply_rules_with_latent
via has_latent_rules dispatch). So when the agent wrote the correct
5-arg signature the tool rejected it and steered the agent to a broken
3-arg rule, which then crashed the engine ("takes 3 positional
arguments but 5 were given").
- Add rollout_predictions() and route both tools through has_latent_rules
dispatch: recurrent rules now score with the latent threaded per
trajectory via the shared _fit_parameters_latent / compute_sse_recurrent
path the engine uses. _snapshot_and_load now surfaces LATENT_INIT.
- Remove a duplicated synthesis-prompt block (bad-merge artifact that also
double-injected the recurrent section) and template the rule-signature
example: fully-observable keeps the 3-arg form, the PO subclass shows
only the recurrent 5-arg signature (no 3-arg references).
- Add tests for rollout_predictions and FO/PO prompt rendering.
The (roll, tilt, wrist) Euler triple jointly encodes a free SO(3) orientation, so an axis-by-axis state-reconstruction check is degenerate at gimbal lock (tilt=±π/2): equivalent gimbal branches report up to π of spurious per-axis error on the same physical orientation, which surfaced as noisy "Could not reconstruct state exactly" warnings on robot.roll / robot.wrist. Add _ORIENTATION_EULER_TRIPLES and _euler_orientation_angle (geodesic angle between unit quaternions) and compare the triple as a single rotation, excluding its axes from the per-axis pass. The residual now surfaces as one small <orientation> angle instead of misleading per-axis rows. Adds gimbal-lock tests.
Large MCP tool results returned inline were truncated by the agent SDK and dumped to ~/.claude/projects/.../tool-results/ (outside the sandbox), then the agent was instructed to read that host path -- the one out-of-sandbox access observed in the boil predicate-invention runs. - Add _make_spilling_text_result and route all three tool factories through it: results over ~30k chars now spill to <sandbox>/tool_outputs/ with a head/tail preview, so nothing is dumped outside the sandbox. inspect_* (create_mcp_tools) previously had no spill; run_python already did. - Add _screen_text_for_sandbox_escape and a matching self-contained Bash screen in VALIDATE_SANDBOX_SCRIPT (matcher now includes Bash): reject absolute / .. paths resolving outside the sandbox and predicators-source introspection. run_python is screened in-tool (the file-path hook does not cover MCP tools); Bash is screened by the hook. Heuristic, not a hard boundary (subprocess/env/computed paths can still escape; OS isolation remains the real boundary). Verified against all 64 historical tool calls in the logs: only the 3 seed3 leak reads are blocked, zero false positives on legitimate calls.
The 'Refinement vs. forward validation' pitfall examples in the synthesis system prompt named heat_level, the heat rule, jug-to-burner gating, and WaterBoiled — leaking the pybullet_boil latent's name and causal structure to the agent during model synthesis. Rewrite both using the generic widget/fixture/WidgetReady/process_value vocabulary already used elsewhere in the prompt, preserving the lessons unchanged.
During bilevel refinement the option model backtracks by resetting the PyBullet env to a search node's state. Features derived from a hidden sim-feature (e.g. bubbling_level read out from heat_level) cannot be reconstructed from an observation-only State, so they come back at their default (0). A learned rule that reads its own emitted observable back as input (a latch) then silently loses state, making otherwise-valid plans unrefinable — even though a continuous forward rollout works. PyBulletEnv._set_state now records the (object, feature) pairs it could not round-trip (_last_unreconstructible_features, via a structured _reconstruction_mismatch_features helper); it is cleared on sequential rollouts where no reset happens. The agent-sim combined simulators call a new _restore_unreconstructible_process_features that overwrites exactly those features (intersected with the declared PROCESS_FEATURES) with the carried value before the rules run. Scoping to the env-reported lossy set leaves base-reconstructible co-owned features (e.g. a robot-movable, wind-blown x,y) untouched, so this does not freeze them.
Tell the synthesis agent to keep any state carried across steps (counters, accumulated levels, irreversible "done" flags) in the threaded `latent` block, and to treat emitted observables as outputs only — recomputed from `latent` each step, never read back as input. Only `latent` is guaranteed to survive the planner's state resets during refinement, so a rule that latches on its own emitted feature passes a step-by-step rollout yet breaks at refinement time. Kept general (no env-specific names) and points at the existing Pattern A/B examples, which already follow it.
The agent_bilevel explorer previously refined with check_final_goal=False and reported "solved" purely from real-env execution, so a learned model that produces an executable plan but mispredicts the goal could trigger early stopping despite being unable to plan to the goal in its own model. Now the explorer refines with check_final_goal=True and records whether the mental model reached the task goal. refine_sketch's truncate_on_subgoal_fail additionally captures a final-goal failure (renamed deepest_subgoal_fail_* -> deepest_fail_*), so a goal the model predicts won't hold still runs end-to-end in reality as an experiment rather than being dropped. The verdict rides ToolContext to get_interaction_requests, which stamps InteractionRequest.mental_model_solved; main._generate_interaction_results treats a False verdict as not-solved for early stopping (None = no verdict, so other explorers are unchanged).
Replace the pybullet_boil/`heat_level` examples in the State.data and State.latent docstrings with environment-agnostic wording, matching the existing effort to keep core structs free of boil-specific leakage.
The switch envs define "fully on" as joint_scale * jointUpperLimit (~10% of the joint's URDF travel) but leave the prismatic joint free, so a gripper push can over-extend the slider into the remaining travel. From there the reverse push can no longer drag it back across the on/off threshold -- e.g. in boil, SwitchBurnerOn over-pushes the switch to frac~1.5 and the later SwitchBurnerOff then fails to turn it off, leaving BurnerOff unsatisfied. Forward-validation masked this because the switch is excluded from the observable state and reconstruction resets snap the joint back to the canonical on-position (frac=1.0), from which the off-push works. Add cap_switch_joint_travel (pybullet_helpers/objects.py): a changeDynamics upper limit at joint_scale * jointUpperLimit so "fully on" coincides with the joint's physical stop. changeDynamics is invisible to getJointInfo, so each env's frac readout (on=1.0 / off=0.0 / threshold=0.5) is unchanged -- only the unreachable over-extension headroom is removed. It is a no-op for switches that are only toggled programmatically. Applied at switch creation in boil, laser, switch, magic_bin, barrier, and fan (fan's setJointMotorControl2 drives the fan blades, not the switches).
Give every PyBullet env a "studio room" look -- muted floor, warm backdrop walls, wood table texture, a directional key light with contact shadows, and a neutral GUI background -- instead of the flat default scene. The backdrop room and key-light direction are derived from each env's camera, so the look adapts automatically; an env can override any piece via class vars or opt out with _use_studio_visuals = False. It is applied through the base PyBulletEnv (initialize_pybullet / render / __init__), so every env using the shared setup gets it; only domino needed its two-table initialize_pybullet updated (now via super()). The rendering machinery lives in a new pybullet_helpers/studio_visuals.py module, leaving the env classes with just the per-env-overridable studio config. Wall textures are generated by scripts/generate_room_textures.py.
Two CFG knobs let agent_planner run as a model-free or base-sim baseline against the world-model learner: - agent_planner_use_simulator (default True): when False, the planner gets no option model, so test_option_plan and the scene-rendering tools (visualize_state/annotate_scene) are withheld and the prompt shifts to open-loop framing -- it must plan from trajectory data and LLM reasoning alone. - agent_planner_use_base_simulator (default False): when a simulator is used, wraps the base env (skip_process_dynamics=True) instead of the real one, denying the delayed _domain_specific_step dynamics. create_option_model gains a skip_process_dynamics passthrough (forwarded only when True, so non-PyBullet analog envs are unaffected). docker_agent_runner honors the base-sim flag on its in-container rebuild. agent_bilevel asserts a non-None option model. Defaults reproduce existing behavior.
docformatter 1.4 wanted re-wraps of the genericized latent docstrings in structs.py/utils.py. mypy flagged AgentAbstractionLearningApproach because AgentPlannerApproach now types _option_model as Optional (it genuinely can be None on the model-free path) while BilevelPlanningApproach types it non-Optional; suppress the unavoidable diamond-merge [misc] error.
1e042ec to
6166a81
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the partially-observable (PO) version of boil online end-to-end and
makes the supporting machinery robust. The PO pipeline now solves 5/5 boil
tasks across 5 seeds with the
agent_po_sim_predicate_inventionapproach(see tag
boil-po-solves-5-of-5).This PR collects the work since #35:
pybullet_boilis made partially observable;Stategainslatent/privilegedblocks; the sim-learning approach threadslatent through synthesis, predicate-quality eval, and refinement. New
agent_po_sim_predicate_inventionapproach + PO ground-truth simulator.pitfall examples (no boil leakage), 5-arg PO synthesis signature.
clamp, replacing an unenforced
changeDynamicslimit), geodesic EEorientation comparison in reconstruction diff, shared studio-room visuals,
on-position joint cap.
goal verdict;
agent_plannerflags to deny/limit its planning simulator.CI
All four checks pass locally (
pytest,mypyincl.--platform linux,pylint, autoformat). Final fixes in this PR:changeDynamicsswitch cap with a deterministic per-step clamp(
PyBulletEnv.register_capped_switch_joint/_clamp_capped_switch_joints),fixing
test_push_second_switch_boil_position_modewhile keeping every env'son/off
fracsemantics unchanged.structs.py/utils.py.[misc]error onAgentAbstractionLearningApproach(_option_modelisOptionalon theagent path, non-
OptionalinBilevelPlanningApproach).